An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes
نویسنده
چکیده
We develop in this article the first actor–critic reinforcement learning algorithm with function approximation for a problem of control under multiple inequality constraints. We consider the infinite horizon discounted cost framework in which both the objective and the constraint functions are suitable expected policy-dependent discounted sums of certain sample path functions. We apply the Lagrange multiplier method to handle the inequality constraints. Our algorithm makes use of multi-timescale stochastic approximation and incorporates a temporal difference (TD) critic and an actor that makes a gradient search in the space of policy parameters using efficient simultaneous perturbation stochastic approximation (SPSA) gradient estimates. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal policy.
منابع مشابه
A constrained optimization perspective on actor critic algorithms and application to network routing
We propose a novel actor-critic algorithm with guaranteed convergence to an optimal policy for a discounted reward Markov decision process. The actor incorporates a descent direction that is motivated by the solution of a certain non-linear optimization problem. We also discuss an extension to incorporate function approximation and demonstrate the practicality of our algorithms on a network rou...
متن کاملA Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies
We consider the estimation of the policy gradient in partially observable Markov decision processes (POMDP) with a special class of structured policies that are finite-state controllers. We show that the gradient estimation can be done in the Actor-Critic framework, by making the critic compute a “value” function that does not depend on the states of POMDP. This function is the conditional mean...
متن کاملVariance Adjusted Actor Critic Algorithms
We present an actor-critic framework for MDPs where the objective is the variance-adjusted expected return. Our critic uses linear function approximation, and we extend the concept of compatible features to the variance-adjusted setting. We present an episodic actor-critic algorithm and show that it converges almost surely to a locally optimal point of the objective function. Index Terms Reinfo...
متن کاملHow to Rein in the Volatile Actor: A New Bounded Perspective
Actor-critic algorithms are amongst the most well-studied reinforcement learning algorithms that can be used to solve Markov decision processes (MDPs) via simulation. Unfortunately, the parameters of the so-called “actor” in the classical actor-critic algorithm exhibit great volatility — getting unbounded in practice, whence they have to be artificially constrained to obtain solutions in practi...
متن کاملOnActor-Critic Algorithms
In this article, we propose and analyze a class of actor-critic algorithms. These are two-time-scale algorithms in which the critic uses temporal difference learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction, based on information provided by the critic. We show that the features for the critic should ideally span a su...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Systems & Control Letters
دوره 59 شماره
صفحات -
تاریخ انتشار 2010